Comprehensive Factors for Predicting the Complications of Diabetes Mellitus: A Systematic Review

Background This article focuses on extracting a standard feature set for predicting the complications of diabetes mellitus by systematically reviewing the literature. It is conducted and reported by following the guidelines of PRISMA, a well-known systematic review and meta-analysis method. The research articles included in this study are extracted using the search engine “Web of Science” over eight years. The most common complications of diabetes, diabetic neuropathy, retinopathy, nephropathy, and cardiovascular diseases are considered in the study. Methods The features used to predict the complications are identified and categorised by scrutinising the standards of electronic health records. Results Overall, 102 research articles have been reviewed, resulting in 59 frequent features being identified. Nineteen attributes are recognised as a standard in all four considered complications, which are age, gender, ethnicity, weight, height, BMI, smoking history, HbA1c, SBP, eGFR, DBP, HDL, LDL, total cholesterol, triglyceride, use of insulin, duration of diabetes, family history of CVD, and diabetes. The existence of a well-accepted and updated feature set for health analytics models to predict the complications of diabetes mellitus is a vital and contemporary requirement. A widely accepted feature set is beneficial for benchmarking the risk factors of complications of diabetes. Conclusion This study is a thorough literature review to provide a clear state of the art for academicians, clinicians, and other stakeholders regarding the risk factors and their importance.


INTRODUCTION
The high prevalence rate of diabetes mellitus (DM) and its complications is leading to a severe global health burden [1,2].Due to the high demand for expert knowledge, the soaring expenses in the healthcare sector, and the high cost of diagnostic tests and equipment, developing health decision-making systems through machine learning (ML) techniques has gained much attention in assisting disease diagnosis and prediction [3][4][5].Classification, time-series analysis, clustering, and predictive analysis are well-known ML techniques in designing systems in the healthcare context to accomplish purposes, such as categorising the dataset, analysing and finding patterns over a period, and forecasting the future.Although several acute and chronic complications are consequences of DM, diabetic retinopathy (DR), diabetic neuropathy (DNeu), diabetic nephropathy (DNep), and cardiovascular diseases (CVD) can be considered the most frequent and severe comp-*Address correspondence to this author at the University of Waikato, Hamilton, New Zealand; E-mail: william.wang@waikato.ac.nz 1875-6417/24 lications [6].The potential repercussions and irreversibility of the complications mentioned above of diabetes mellitus (CoDM) lead to the development of risk prediction/scoring systems.
Predicting diabetes and its complications using ML techniques is evolving as a prominent research area due to their potential health outcomes and ability in personalised medicine.The cost-effectiveness, utilisation of expert knowledge, and enhancement of health indices from ML-aided systems made them more popular in the digital healthcare industry.Feature selection is one of the vital steps that the model designer should perform while developing risk prediction/scoring models.However, predicting the risk of CoDM is challenging due to confounders and the existing disputability of a standard feature set [4,7].Therefore, this study aims to assist in the feature selection phase of model development by creating the most frequently used feature sets for each CoDM by conducting a systematic review according to a well-known scientific method, PRISMA.Systematic reviews are commonly used for summarising research findings in health care [8].Stakeholders in healthcare 2. Briefly describe the highly prioritised frequent factors.
The research method and materials are included in section 2, with a description of the process of article selection and feature extraction.The result and discussion sections include the extracted feature set and brief descriptions of frequent features.Research implications are included in section 4, and section 5 is the article's conclusion.Frequently used abbreviations in the article are listed below:

RESEARCH MATERIALS AND METHODS
This study has been performed according to PRISMA guidelines [9], a widely accepted scientific framework for reporting systematic reviews and meta-analyses.The four main PRISMA steps are identification, screening, eligibility, and inclusion.In the identification phase, researchers report the number of records identified through database searching and other sources.Duplicate records are removed in the screening phase while reporting the number of records that need to be screened and the number of records excluded.Studies should be filtered by excluding irrelevant articles through proper eligibility criteria.In the eligibility phase, the number of full-text articles assessed for eligibility and those excluded should be reported with reasons.Finally, the number of studies included in the systematic review should be specified at the inclusion [9].The repositories used for article selection, the methods used to extract the most relevant articles, excluding criteria, and reporting the results are milestones of conducting a more sound and scientific systematic review.
The current systematic review study adopts the guidelines of the PRISMA method to conduct and report the research results.The research article identification has been done through a well-known article repository, "Web of Science" [12], which provides a consistent search interface to multiple databases of academic journals, conference proceedings, letters, and other related publications in various disciplines.The research articles are identified using a scientifically-structured search query: Query (("risk*" or "risk model*" or "risk assess*" or "risk equation*" or "risk predict*") AND ("Diabetes*" or "Complication of diabetes" or "Complications of diabetes" or "comorbidities of diabetes*" or "comorbidity of diabetes*" or "diabetic*") AND ("Statistical model*" or "Regression model" or "Cox*" or "Artificial Intelligence*" or "*model*" or "Time series analysis*" or "Machine Learning" or "Time series Forecasting")).
The search has been filtered under publication year, document type, accessibility, and publication journal.Research papers published within the last eight years (2015-2023) were considered in this study.The latest electronic search was performed on 1 st July, 2023.The reviews, proceeding papers, meeting abstracts, editorial materials, book chapters, letters, and news items were filtered out from the search results due to the possible uncertainties and inconsistencies that they might bring.Furthermore, only open-access articles were selected for this study.Fifteen journals with the highest impact factor were chosen as top-ranked journals among the resulting journals from the search query.The details of the journals can be found in the appendix (Table A1).The resulting articles were manually selected as eligible for the study by considering their relevance to the research topic.The relevancy criteria focused on the aim of the research study, the presence of considered risk factors, and the type of features they considered.The feature list for each complication has been extracted by reviewing the features used in selected studies.The selection criterion for the frequent features is their presence in at least 20% of selected articles.For example, to be selected as a frequent feature for DNeu, the feature should be present in at least 4 of the 20 selected articles (20%) in that category.The frequency of each complication has been set as the threshold value to determine the features.The extracted frequent 30 feature set has been categorised using the USCDI standards, resulting in nine categories.Although the lifestyle features were unavailable in EHR, we included them in a separate category as lifestyle features due to their importance in some research papers.The highly frequent feature list was created by selecting the common features for all complications, with 20% or more in each complication.The frequencies of the features were calculated as percentages to visualise their utilisation in each complication.Moreover, the study's second objective describes the identified top eight features common to all four complications.Section 3.2 describes the features to provide a sound understanding of features, their usage in studies, and the significant factors about them.

RESULTS AND DISCUSSION
The findings of the study are presented and discussed in this section.The identified set of features is mentioned in the first part of the section, while the selected persistent features are described later.

Risk Factors for Predicting Complications of Diabetes Mellitus
The selected features of complications are illustrated with their percentages in Fig. (2).Among the chosen attributes, age, gender, ethnicity, weight, height, BMI, smoking history, HbA1c, SBP, eGFR, DBP, HDL, LDL, total cholesterol, triglyceride, use of insulin, duration of diabetes, family history of CVD, diabetes has been recognised as the feature subset, which can be used in risk prediction of all four complications.According to the selected feature list, predicting CVD gets the most significant number of features (n=29), while DR has the most minor features (n=19).DNeu and DNeph result with 24 features.Age and gender remain the two most frequent features for DNeu and DR, while gender is the third-highest priority for CVD and DNeph.Although the absence of a percentage of a factor in one complication represents its infrequency, it does not mean the invalidity of it in predicting that complication.For example, the percentage of the feature of "urine albumin to creatinine ratio" only shows in DNeph, but it had been used for all other complications less frequently.Some features have been extracted due to their frequency in one complication, which is entirely unrelated to other complications."Retinal arterial calibre" is a feature that has been selected due to its frequency in DR, which cannot be used anywhere else."Renal disease requiring dialysis", "Myocardial infraction," and "Fibrates" are a few other features that have been selected based on one complication.Moreover, the terminology used in different articles varies hugely.Due to the requirement to extract the terms in the papers as they were, some feature terms may overlap.For example, "Metformin" and "Oral medication for diabetes" are two feature values.Furthermore, the term "Antidiabetic medication" is also included under the medications of diabetes to maintain the authenticity of the words.The resulting plots of percentages of risk factors in each complication are included in Appendix Figs.(A1-A4).
The articles used for extracting the feature sets for the complications are tabulated in Table 1.
The table consists of columns representing the complications, while the rows represent the selected features and their sub-categorisation.The total number of studies is mentioned in each cell to emphasise the significance of the feature in each complication.The total number of studies used in each complication has been written with the complication type in the table headers.Furthermore, Table 1 can be used as a reference in finding the articles for predicting CoDM and retrieving the articles based on specific risk factors.

Introducing Highly Prioritised Risk Factors
The following section describes the selected attributes as the top frequent features in predicting selected complications.The identified top eight features of the study: Age, gender, BMI, eGFR, SBP, HbA1c, smoking history, and DBP are described in the following section to provide a concise description of their usage in different studies.Due to the potential importance of the top eight features in describing their nature in the literature, the highly ranked eight features are selected for the description.

Gender
Gender has been ranked as the highest prioritised feature in DR, while it remains in the top four positions in others.According to the IDF, the prevalence of DM in women (9.0%) in the age of 20-80 is slightly lower than that of men (9.6%) [105].A retrospective cohort study based on the clinical data in England found that men were diagnosed 2.6 years earlier than women, and the CVR risk factor management was worse in women than men [67].Considering other factors, such as standardised BMI values, gender plays a vital role in predicting CoDM.Wright et al. concluded that CVD risk is higher in men than women with T2DM [69].Patients diagnosed with T2DM at a young age (18-45 years) were more frequently male than female [68].Thus, there was a consistent association between CVD and the male sex [70].However, the lipid profile of patients is not highly associated with gender [3].No effect modification has been found with the sex disparity among patients with cardiovascular autonomic neuropathy [71].Furthermore, CVD and major adverse cardiovascular events (MACE) were not associated with sex [65,67].Additionally, a significant association cannot be found between sex and incident DPN [13].Braffett et al. concluded a weak association between sex and cardiovascular autonomic neuropathy (CAN) [3].Moreover, the gender values of the results of studies are reported as female [1, [72][73][74], male [3,4,15,65,68,[75][76][77], and also as female and male as two categories [16,21,22,67,70,78,79].

BMI
BMI is calculated as the weight (kg) of a person divided by their height squared (m 2 ) [68].BMI and waist circumference are related to obesity of a person, which are considered clinical inflammation factors for DR [35].The prevalence of DM is higher in patients who are overweight or obese [66].Obesity has been recognised as a risk factor for DPN [13,15]."Evidence showing the association of obesity and DR remains equivocal" [35].Furthermore, a significantly high mean BMI can be seen among patients with CAN [77].Additionally, an inverse relationship has been proven between the age at diagnosis of T2DM and BMI [80].The BMI is eight units higher in patients diagnosed with T2DM at <41 years old than those who developed T2DM in their 91 s [80].Thus, patients with early diagnosis (18-45 years) have high mean BMI values [68].The baseline value of BMI is higher in women than men at the diagnosis of T2DM [67].The high risk of CSME and ocular surgery was associated with increased BMI [5].Although BMI is a well-established risk factor for predicting CVD among patients with T2DM, BMI or obesity did not emerge in some studies [65].A well-established online risk assessment tool for CoDM and its extensions, such as QRISK2 and QRISK3, use BMI as a decisive risk factor for predicting CVD [6].Furthermore, BMI values are categorised in different ways by different authors.Table 3 represents some of the categorisations of BMI values in literature.

SBP (Systolic Blood Pressure)
SBP measures the pressure of the blood exerting against the artery walls when the heart beats.SBP shows the highest frequency in predicting CVD and DR.Greater values of SBP than 141 mm Hg are more common in diabetes patients than in non-diabetes [66].SBP values are differently categorised literature.Bragg et al. made three categories of SBP as <120, 120-140, >= 141 [66], and Ku.et al. created four categories (<120, 120 -<131, 131-<141, >=141 [51]).Higher SBP has been selected as a significant predictor in previous studies [65,66].An apparent reduction of SBP (-6.31 mmHg) results from an increase of "moderate to vigorous-intensity physical activity" among the individuals who increased their physical activity for over four years of follow-ups [81].SBP was considered a risk factor in QRISK2, a well-known algorithm for predicting the risk of developing cardiovascular disease in the next 10-year period.The upgraded version (QRISK3) of QRISK2 used the SBP and the variability of SBP as two risk factors.Although including the standard deviation of blood pressure did not improve the discrimination and calibration in predicting CVD, the SBP and its variability were independently associated with an increased risk of CVD [6].The higher mean value of SBP has been stabilised as a risk factor in predicting any CVD [77].Rørth et al. also concluded that considering SBP as a time-varying covariate did not show a disparity between diabetes and non-diabetes.High mean SBP can be seen in foot ulceration in patients with T2DM, while no significant disparity can be seen in T1DM.SBP has been recognised as the top risk factor for predicting CAN [3].

HbA1c
HbA1c is a well-established measurement for diagnosing DM, which measures the average blood glucose amount for one to two months.A strong relationship between HbA1c and CVD has been concluded [14].Furthermore, Andreson et al. concluded that "the rate of HbA1c increase affects the development of diabetic polyneuropathy over and above the effect of the HbA1c level" [15].A study revealed a paradoxical finding of a gradual increase of HbA1c among the patients who show better baseline glycaemic control [22].Moreover, the higher mean value of HbA1c is recognised as the greatest risk factor for predicting proliferative DR and CSME [5].HbA1c is concluded as the strongest risk factor for the progression of DR [5].A U-shaped relationship has been found between HbA1c and mortality, where the modest glycaemic control (HbA1c 7.1-8.0%)shows the lowest mortality risk in patients with T2DM and CHF [74].Some researchers consider the HbA1c values as a definite factor.Elder et al. categorised it into three classes, <=7.0, 7.1-9.0,and >9.0, where five categories of HbA1c classes were used to separate the participants at baseline, such as <6, 6.1-7.0,7.1-8.0,8.1-9.0 and >9 [74].Moreover, ((6.5-8.0%),(<6.5%), (>8.0%) [22]), ((<=7.0%or <=54 mmol/mol), (7.1-8.0%or 55-65 mmol/mol), (>8.0% or >65 mmol/mol), (unknown) [36]) are some of the other categories.

Smoking History
Smoking history is a significant behavioural feature frequently used in risk model creation in all four considered complications.This feature has been ranked in the top three features in CVD and DNeu.The smoking history data has been collected under various categories.Table 4 presents existing categorisation in collecting, analysing and reporting the details of smoking history in literature.
A progressive increase in CVD risk with smoking status among diabetes patients has been confirmed by Bragg et al. [66].However, few researchers stated that there is no statistically significant association between the risk of DPN and smoking [13,15].Furthermore, smoking has not been selected in two models created by Penno et al. as a significant independent risk factor in predicting CKD among diabetes patients [55].

DBP (Diastolic Blood Pressure)
DBP is the pressure in arteries when the heart rests between beats.The standard value of DBP is 81mmHg.DBP ranks higher in CVD and DNeu than in the other two complications.Elevated DBP is linked to higher rates of kidney function decline in diabetic kidney disease [41-43, 53, 54, 59].An association between specific DBP patterns and glomerular filtration rate has been recognised among patients with DNeph [59].Moreover, DBP is linked to early cardiac dysfunction in diabetes [70,72,84,90].DBP is also considered sex-specific in predicting occlusive vascular and mortality outcomes in diabetes [14].Scholars found evidence of the association between diuretic use, increased DBP, and lower limb events in type 2 diabetes [23,27], while the TODAY study confirmed that metabolic factors, race-ethnicity, and sex influence nephropathy development concerning DBP [64].

Discussion on Features used in Predicting CoDM
As the result of this study, we identified 60 features frequently used as predictors in CoDM.Nineteen features are recognised as common in all four considered complications, which are age, gender, ethnicity, weight, height, BMI, smok-ing history, HbA1c, SBP, eGFR, DBP, HDL, LDL, total cholesterol, triglyceride, use of insulin, duration of diabetes, family history of CVD, and diabetes.Although we consider the frequent features used in literature, some features may have significant importance in predicting CoDM but are not used as frequently as the other traditional features used in the literature.The genetic risk factors are an excellent example of features not commonly used to design risk prediction models but possess significant prediction power.Recently, there has been a trend in using genetic factors [94] and biomarkers [63] for predicting CoDM.Single nucleotide polymorphisms (SNP) are used vastly as a genetic factor for predicting CVD [94].Furthermore, the biomarkers TNFR-1, TNFR-2, and KIM1 have been used in predicting diabetes nephropathy.Moreover, diagnosis and prognosis of diabetes retinopathy have been done with image processing techniques.Therefore, the features used in this complication are varied from the rest.Retinal arterial calibre (CRAE) [31,41], arteriolar tortuosity, and fractal dimensions are commonly used in predicting DR with image processing techniques [39].The effect of retinal venular tortuosity and fractal dimensions in predicting incident retinopathy is explored to understand their prediction capability [38].The feature categories used terminology and how features are used vary among scholars.The usage of features in prediction models is hard to generalise since data availability, the nature of data sources, the focus of the risk model, and selected ML techniques are hugely different.

Research Implications
The findings of the research study have shown significant academic and clinical impact.The identified set of features of the CoDM is highly useful for researchers who predict CoDM with statistical and computerised prediction models.By systematically extracting and analysing frequently used risk factors from the existing literature, the review sheds light on the critical determinants contributing to the development and progression of complications associated with diabetes.The feature set can be used as a quick reference in feature selection and feature engineering phases in model creation.Identifying these risk factors not only advances our understanding of the multifaceted nature of diabetes complications but also provides a comprehensive framework for future research endeavours.Furthermore, academics can use the feature set to validate and prove the credibility of their feature selection not only against individuals but also against a thorough literature review, which enhances the model's acceptability.Moreover, the provided list of articles that use the identified features in each complication guides academics to extract the relevant papers for future studies effectively.

Clinical Implications
Clinicians, general practitioners, policymakers, and other stakeholders in the industry of healthcare can use the findings of this research to update their knowledge of the domain.The most frequently used feature set and the brief descriptions of selected features provide state-of-the-art and updated knowledge, beneficial in disease diagnosis and decision-making.Clinicians and healthcare professionals can utilise the identified risk factors as valuable tools to assess the likelihood of complications in diabetic patients, enabling personalised and targeted intervention strategies.By integrating these risk factors into clinical assessments, healthcare providers can proactively identify patients at higher risk of nephropathy, neuropathy, cardiovascular disease (CVD), and retinopathy, facilitating early intervention and more effective complication prevention.The importance of each feature in predicting the complications provides a direction for clinicians to investigate them to make informed decisions.The current study's findings can adapt to future risk prediction, disease diagnosis and prognosis models.Additionally, the overall outcome of the study provides a thorough state-of-the-art, which keeps the clinicians and other stakeholders updated with the domain knowledge.

CONCLUSION
Disease prognosis based on ML algorithms with various risk factors has become a prominent research area due to its convenience and cost-effectiveness.Risk scoring models are frequently created to fulfil the purpose of non-invasive or minimally invasive risk prognosis.Since the model accuracy and reliability depend highly on the selected feature set, choosing the most appropriate feature set is vital prior to model designing.This systematic review focuses on extracting the most frequently used attributes for predicting the complications of diabetes mellitus through utilising the features of EHR.Due to their elevated and irreversible health effects, DNeu, DNeph, DR, and CVD are selected as severe CoDM.After searching the related articles for the recent eight years in top-ranked journals, the most common facet set that belongs to each category of complication was selected.According to the results, fifty-nine features have been chosen as frequent features for predicting the selected CoDM.Among them, age, gender, ethnicity, weight, height, BMI, smoking history, HbA1c, SBP, eGFR, DBP, HDL, LDL, total cholesterol, triglyceride, use of insulin, duration of diabetes, and family history of diabetes CVD are determined as the persistent features in the prediction of CoDM.The identified feature set provides meaningful exploration of the state-of-the-art features used in the feature selection phase of model designing at risk prediction model creation.Furthermore, future researchers can use this feature set to validate their selected feature sets against the state of the art.The extracted feature set from the study can be used for future research to identify the most frequent features in predicting each CoDM.Moreover, the results can be used as a reference to validate the feature sets that individual researchers would extract.The valuable implications of this study are the effectiveness of adapting the identified feature set in future studies and the capability of benchmarking the feature set of individual designs to an extracted feature set resulting from a properly performed systematic review.Due to various features with different scaling systems, categories, and uses in the studies, it is worthwhile to make a platform for discussing the similarities and variations of used features in various studies.The given introduction of the selected features provides a concise description of the usage of the feature in past studies, which is vital in feature selection and feature engineering.The outcomes of this systematic review are helpful for academics and stakeholders in the healthcare sector to understand the domain for making informed decisions.In summary, the systematic review has profound implications for the research and clinical domains.By systematically uncovering frequently utilised risk factors, the review not only advances our understanding of diabetes complications but also empowers researchers and clinicians alike to collaboratively shape the future of diabetes management, prevention, and patient care.

AUTHORS' CONTRIBUTIONS
M. A. and YCW wrote the main manuscript text, CCL, MM, and WYC did the editorial works, CCL provided ideas from the clinical aspects, and YCW did the final proofreading and submissions.

STANDARDS OF REPORTING
PRISMA guidelines and methodology were followed.
PRISMA checklist is available on the publisher's website.
Finally, articles have been divided into four peers according to the complication type.The flow diagram of the article selection of the study is illustrated in Fig. (1).